Probability, likelihood and Bayes

Day 1

Manuele Bazzichetto

Probability

Meaning? Depends on who you ask..

Frequentist: Essentially, the (long-run) relative frequency (or proportion) of an event happening

Bayesian: Essentially, the relative plausibility of an event happening given what we already know about what generates events and what we actually observe (i.e., data)


Which one is best?

NONE

Both are useful

Probability rules

“Frequentist” or “bayesian”..probabilities obey to rules:

  • Number bounded between 0 and 1 (i.e., \(0\leq Pr \leq1\))
  • Union (mutually exclusive): \(Pr(A \cup B) = Pr(A) + Pr(B)\), if \(Pr(A \cap B) = 0\)

  • Intersection: \(Pr(A \cap B)\)

  • Union (not mutually exclusive): \(Pr(A \cup B) = Pr(A) + Pr(B) - Pr(A \cap B)\)

  • Joint probability: \(Pr(A) \cdot Pr(B)\), if A and B are independent

  • Independence: \(Pr(A|B)=Pr(A)\) and \(Pr(B|A)=Pr(B)\)

  • Conditional probability: \(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\)

\(\cup\): read it like “probability of either A OR B or both occurring”

\(\cap\): read it like “probability of A AND B simultaneously occurring”

Probability rules

Note that, under independence between A and B: 1

\(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\\Pr(A)\cdot Pr(B)=Pr(A \cap B)\)


While, under lack of independence between A and B: 2

\(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\\Pr(A|B)\cdot Pr(B)=Pr(A \cap B)\)

A note on pdf vs. pmf

Discrete measures 👉 probability

Continuous measures 👉 density


  • The probability for any specific value of a continuous measure is \(0\)

  • Densities are related to (but not exactly the same as) probabilities

  • Both still obey to probability rules: pdf(s) integrate to 1; pmf sum up to 1

A note on cumulative distribution function(s)

Cdf(s) map measures to their probability of assuming a specific (or lower) value

Usually written as: \(F = Pr(X \leq x)\)

Cdf(s) exist for both continuous and discrete measures

A note on quantiles

Quantiles are values assumed by a measure that split its pdf (or pmf) in two groups of observations

Example: percentiles split a probability distribution in 100 samples of measures of equal size (and probability)

Example: the median is the 2nd quartile

R makes it easy

The Fantastic 4

d*, p*, q*, r*

d*: compute density (cont.) or probability (discr.)

p*: returns \(Pr(measure\leq quantile)\) (mind the tail argument)

q*: returns quantile for a given p* (mind the tail argument)

r*: draw random values of measures from a model

Examples:

Gaussian: dnorm, pnorm, qnorm, rnorm

Binomial: dbinom, pbinom, qbinom, rbinom

Data, models, probability


Data: information we have available

Model: a set of assumptions to describe a simplified version of reality

Parametric model: Model described by parameters (see pdfs and pmfs)

Probability: how measures behave according to our model

Likelihood and ML estimation

We have data and models, what do we do now?

Let’s use data to estimate model parameters!


ID BodyMass
1 5085.467
2 4983.132
3 4384.706
4 4773.966
5 5224.501
6 5272.518
7 4467.005
8 4892.681

Assumption: Body mass of (all existing) Gentoo’s penguins is normally distributed with some mean and variance

Parametric model: \(Gentoo\hspace{1 mm}body\hspace{1 mm}size \sim \mathcal{N}(\mu,\, \sigma^{2})\)

Likelihood function

  • \(Probability(BodyMass = 3000) \rightarrow Pr(BM_i = value_i)\)
  • \(Pr(BM_1 = value_1) \times Pr(BM_2 = value_2) \hspace{1 mm} \times \hspace{1 mm} ... \hspace{1 mm}\times \hspace{1 mm} Pr(BM_n = value_n)\)
  • \(\prod\limits_{i=1}^{n} Pr(BM_i = value_i)\)
  • \(Likelihood \hspace{1 mm} (L) = \prod\limits_{i=1}^{n} Pr(BM_i = value_i|\mu,\sigma^2), \hspace{1 mm} with \hspace{1 mm} Pr \hspace{1 mm} being \hspace{1 mm} the \hspace{1 mm} Gaussian \hspace{1 mm} pdf\)

Maximizing the joint probability of the data|parameters allows finding the parameter(s) that maximize(s) the L of observing the data (under the assumed model)!


Likelihood(parameters|data) = Probability(data|parameters)

Maximum likelihood estimation

Link data, model and L

Data: sample of \(n\) penguins on which we measure BM

Model:

\(BM \sim \mathcal{N}(\mu,\,\sigma^{2})\\\)

Probability (density) for \(BM_i\):

\(f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{BM_i-\mu}{\sigma}\right)^2}\)

L (given model):

\(\prod\limits_{i=1}^{n} \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{BM_i-\mu}{\sigma}\right)^2}\)

“Move” along combinations of \(\mu\) and \(\sigma^2\) and find those that maximize L

Maximum likelihood estimation

We usually maximize the log-Likelihood (LL) for two main reasons:

  • Products become sums:

\(log(\prod\limits_{i=1}^{n}X_i) = \sum\limits_{i=1}^{n}log(X_i)\)

  • Easier to work with exponential functions (like lots of pdfs and pmfs)

Examples

We will:

  • Estimate the population mean of Gentoo’s body mass using brute force

  • Estimate regression parameters for the relationship between Gentoo’s body mass and flipper length (without using brute force)

  • Estimate rate parameter of a Poisson population

NOW GO TO R..

What I think I am doing


Model:

\(\mu_i = \alpha + \beta \cdot flipper\hspace{1mm}length_i\)

What I am actually doing

Model:

\(Gentoo\hspace{1 mm}body\hspace{1 mm}size_i \sim \mathcal{N}(\mu_i,\, \sigma^{2})\)

\(\mu_i = \alpha + \beta \cdot flipper\hspace{1mm}length_i\)

Poisson

\(Y \sim Pois(\lambda)\\,with\hspace{1mm} Y\hspace{1mm} assuming \hspace{1mm}value\hspace{1mm} \geq 0\)

Pmf: \(Pr(Y) = \frac{\lambda^Y\exp^{-\lambda}}{Y!}\)


  • \(Mean = variance\)
  • Limiting case of binomial with \(N\) large and \(p\) small
  • Used to model counts with not known upper bound
  • Converges to Gaussian as \(\lambda\) gets large

Things to keep in mind

  • Likelihood function \(\neq\) Pdf

  • We found the MLE(s). Does this mean that we now know the population parameters? NO!

  • MLE(s) have asymptotic properties (sample size matters)

  • From the shape of the LL, we can estimate how precisely we estimate population parameters

Bayes’ rule and bayesian stats

The Bayes’ rule

A re-arrangement of conditional probability:

Conditional probability: \(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\)

  • \(Pr(A|B)=\frac{Pr(A \cap B)}{Pr(B)}\)

  • \(Pr(A|B)Pr(B)=Pr(A \cap B)\)

  • But \(Pr(A \cap B) = Pr(B \cap A)\)

  • And \(Pr(B \cap A) = Pr(B|A)Pr(A)\)

  • So \(Pr(A|B)Pr(B) = Pr(B|A)Pr(A)\)

  • Dividing both sides of the equation by \(Pr(B)\), we end up with:

Bayes’ rule: \(Pr(A|B) = \frac{Pr(B|A)Pr(A)}{Pr(B)}\)

Wait..what does it mean? (Pt. 1)

\(Pr\): we are familiar with it (pdfs, pmfs)


\(Pr(B|A)\): what if I tell you that \(B\) is data and \(A\) model parameters?


YES! Pr(B|A) IS THE LIKELIHOOD!


\(Pr(A)\): prior..a model for the parameter(s) 🤯


\(Pr(B)\): marginal probability of the data

Wait..what does it mean? (Pt. 2)

We gave a name to all ingredients for \(Pr(A|B)\), but what’s \(Pr(A|B)\)?


Recall our aim is to estimate model parameters


This time we won’t restrict ourselves to a unique idea of the DGP, while we’ll rather consider different plausible DGPs


The plausibility of each of these scenarios results from combining what the data suggest about the model (likelihood params|data) and what we assume (about the model) even before looking at the data (the prior):

\(Pr(B|A)\cdot Pr(A)\)

A closer look at Pr(B)

Normalization constant: makes \(Pr(A|B)\) integrate to 1


\(Pr(B)\): marginal probability of the data

\(Pr(B) = \sum\limits_{i=1}^{n}Pr(B|A_i)Pr(A_i)\) from LTP1

\(\sum\limits_{i=1}^{n}Pr(B|A_i)Pr(A_i) = Pr(B|A_1)Pr(A_1) + ... + Pr(B|A_n)Pr(A_n)\)


For any \(A_i\): \(Pr(A_i|B) = \frac{Pr(B|A_i)Pr(A_i)}{Pr(B|A_1)Pr(A_1) + ... + Pr(B|A_n)Pr(A_n)}\)

What do we do with Pr(A)

Look at the MAP and intervals (we have a full distribution to explore!)

Why choosing Bayes

  • Not to limit ourselves to a unique perspective on the DGP

  • Likelihood provides one and only one winner (the MLE)

  • Probably better suited for ecology & observational studies (?) - nature is complex

  • ‘Frequentist’ approach for experiments?

  • Both freq. and bayesian approaches are useful

Estimating parameters

  • Grid approximation

  • Quadratic approximation

  • MCMC 😎

Grid approximation

  • Same as brute force maximum likelihood estimation

  • This time, we compute the posterior at each candidate value(s) for the parameter

  • Importance of grid resolution!

NOW GO TO R..

Quadratic approximation

Assumption: the posterior is Gaussian (near the peak)!

\(log\) of a Gaussian is a perfect parabola

1st derivative of the parabola 👉 gives us its peak

2nd derivative of the parabola 👉 gives us its curvature

MCMC

  • Non-parametric means of sampling (and describing the shape of) the posterior

  • Several algorithms exist (M-H, GibbsSampler, Hamiltonian)

  • Let’s have a look at what they do here

Books!